Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Migrated from Arc<dyn Array> to Box<dyn Array> #1042

Merged
merged 1 commit into from
Jun 12, 2022
Merged

Migrated from Arc<dyn Array> to Box<dyn Array> #1042

merged 1 commit into from
Jun 12, 2022

Conversation

jorgecarleitao
Copy link
Owner

This PR is a backward incompatible change to extend the support for clone-on-write semantics
initially added by @ritchie46 on #794 to arrays.

The migration is simple - replace Arc<dyn Array> by Box<dyn Array>.

Background

Arc main goal is to enable sharing of data, typically because cloning is expensive. The tradeoff is that Arcs usually represent immutable data.

All our arrays are quite small structs. Their underlying data is usually a DataType and one or two Arcs that store the data itself.

Thus, Arcing Array does not add much value. OTOH, it makes the API clumsy, since we need to remember to arc everything. It also aludes users to the immutable nature of the Array. Furthermore, it makes it challenging to offer a good support for clone-on-write semantics to arrays since we need to check if the array is being shared, and clone its internals if yes.

Currently,

  • all our nested/child arrays are stored as Arc<dyn Array>
  • all our IO interfaces, including FFI, received and output an Arc<dyn Array>

My understanding is that there three main reasons for us to be in this situation:

  1. Rust support for cloning Box<dyn T> where T: Clone is not great - it is essentially not possible in std.
  2. historically arrow-rs used to store everything under Arc (e.g. ArrayData, Array, data itself).
  3. The C++ implementation, which arrow-rs is inspired on, seems to use Arc (shared_ptr)

This PR

This PR makes

  • child arrays to be Box<dyn Array> instead of Arc<dyn Array>
  • all IO consume and return Box<dyn Array> instead of Arc<dyn Array>

The major benefit of this behavior is that it makes it quite easy to add first class support clone-on-write semantics to the different Array APIs. As an example, with this PR we can now more easily:

  • read a primitive array from parquet (Box<dyn Array>)
  • evaluate a complex runtime expression on it (e.g. ((a * 10) + 2) * exp(-10 * a))) without any extra allocation

This PR also adds an example illustrating how the API is used.

@codecov
Copy link

codecov bot commented Jun 2, 2022

Codecov Report

Merging #1042 (cf2ea2e) into main (d2f5935) will decrease coverage by 0.21%.
The diff coverage is 65.61%.

❗ Current head cf2ea2e differs from pull request most recent head 3404df7. Consider uploading reports for the commit 3404df7 to get more accurate results

@@            Coverage Diff             @@
##             main    #1042      +/-   ##
==========================================
- Coverage   81.31%   81.09%   -0.22%     
==========================================
  Files         365      363       -2     
  Lines       34902    34723     -179     
==========================================
- Hits        28380    28160     -220     
- Misses       6522     6563      +41     
Impacted Files Coverage Δ
src/array/binary/mod.rs 88.79% <0.00%> (-1.12%) ⬇️
src/array/boolean/mod.rs 85.82% <0.00%> (-2.10%) ⬇️
src/array/equal/mod.rs 81.34% <0.00%> (-1.87%) ⬇️
src/array/fixed_size_binary/mod.rs 87.62% <0.00%> (-1.38%) ⬇️
src/array/map/ffi.rs 0.00% <0.00%> (ø)
src/array/mod.rs 69.35% <0.00%> (-2.42%) ⬇️
src/array/null.rs 45.67% <0.00%> (-1.76%) ⬇️
src/array/union/ffi.rs 0.00% <0.00%> (ø)
src/array/utf8/mod.rs 79.39% <0.00%> (-0.73%) ⬇️
src/buffer/immutable.rs 90.66% <0.00%> (-5.34%) ⬇️
... and 119 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2f5935...3404df7. Read the comment docs.

@jorgecarleitao
Copy link
Owner Author

@houqp, @Dandandan, @dbr, @sundy-li would you be available to review / comment / help on this one? We are trying to significantly improve the clone-on-write story, which is a bottleneck in Arrow, but would like to double-check with you that this does no harm.

@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Jun 8, 2022

may be of interest to @wesm also - I recall that Voltron was looking around avoiding allocations on compute kernels - the main point is (in C++ notation): since data is already inside a shared_ptr inside ArrayData, we can use unique_ptr<Array> instead of shared_ptr<Array> - cloning the Array is cheap anyways. The advantage is that we can easily access the array's content and thus can mutate the array itself in-place.

In cases where data is not shared (which is often, e.g. coming from parquet, IPC, etc.), this has dramatic effects since it can lead to a allocation-free compute for math.

E.g. expressions like exp(-(arr + 1)) * 2 + 1 (5 ops) can be done without allocations, still leverage SIMD, and do not require emitting new LLVM via e.g. Grandiva.

@jorgecarleitao
Copy link
Owner Author

#1061 motivates this PR (~2x speedup on horizontal ops, ~5-10x on horizontal logic ops)

@dbr
Copy link
Contributor

dbr commented Jun 11, 2022

This change sounds perfectly fine from our point of view - we mostly interact with arrow2 data either in read-only fashion or via FFI (and the changes to FFI seem pretty trivial to handle) - so this doesn't really impact us directly, and if it's faster and generally better, then, 🥳

One thing I don't quite understand is,

Thus, Arcing Array [...] makes it challenging to offer a good support for clone-on-write semantics to arrays since we need to check if the array is being shared, and clone its internals if yes

I thought Arc provides clone-on-write? Specifically Arc::make_mut(...) docs say "If there are other Arc pointers to the same allocation, then make_mut will clone the inner value to a new allocation to ensure unique ownership. This is also referred to as clone-on-write"

@jorgecarleitao
Copy link
Owner Author

@dbr, great question.

Let's consider the example added on this PR with an Arc instead of a Box:

// This example demos how to operate on arrays in-place.
use std::sync::Arc;
use arrow2::array::{Array, PrimitiveArray};

// say we have have received an array
let mut array: Arc<dyn Array> = PrimitiveArray::from_vec(vec![1i32, 2]).arced();

// we can apply a transformation to its values without allocating a new array as follows:
// 1. downcast it to the correct type (known via `array.data_type().to_physical_type()`)
let array = Arc::make_mut(&mut array)
    .as_any_mut()
    .downcast_mut::<PrimitiveArray<i32>>()
    .unwrap();

// 2. call `apply_values` with the function to apply over the values
array.apply_values(|x| x.iter_mut().for_each(|x| *x *= 10));

This unfortunately does not compile because Arc::<T>::make_mut requires T: Clone, but dyn Array is not clonable (since it is not Size). The crate dyn_clone offers a safe implementation of Clone for Box<dyn T> when T: Clone. We could get away with something like

// say we have have received an array
let array: Arc<dyn Array> = PrimitiveArray::from_vec(vec![1i32, 2]).arced();

// we can apply a transformation to its values without allocating a new array as follows:
// 1. clone the internals
let mut array: Box<dyn Array> = arrow2::array::clone(array.as_ref());

// 2. downcast it to the correct type (known via `array.data_type().to_physical_type()`)
let array = array
    .as_any_mut()
    .downcast_mut::<PrimitiveArray<i32>>()
    .unwrap();

// 3. call `apply_values` with the function to apply over the values
array.apply_values(|x| x.iter_mut().for_each(|x| *x *= 10));

but here we need to be careful - if we would have named the variable array1 instead of array, in step 3. both arrays are in scope and share the underlying data (due to clone). Thus, apply_values would result in a clone of O(N), since the values are now shared between the two arrays.

@sundy-li
Copy link
Collaborator

sundy-li commented Jun 11, 2022

Looks great to me. cc @andylokandy @leiysky @b41sh may be interested in it if databend can introduce this into new expression framework.

@jorgecarleitao jorgecarleitao merged commit e0f4cee into main Jun 12, 2022
@jorgecarleitao jorgecarleitao deleted the box branch June 12, 2022 21:07
@jorgecarleitao jorgecarleitao changed the title Proposal: Migrate from Arc<dyn Array> to Box<dyn Array> Migrated from Arc<dyn Array> to Box<dyn Array> Jun 12, 2022
@elferherrera
Copy link
Contributor

elferherrera commented Jun 12, 2022

Silly question about this change. I understood that the use of Arc in arrow was to have a Sync and Send array. Will Box provide the same functionality?

Forget this question, as soon as I finished writing I checked doc.rs and I see that Box does implement Sync and Send

@joshuataylor
Copy link
Contributor

Not sure if changelog is generated at version release time, but the only change I needed to make for my code that relied on chunks was to change Arc -> Box and it worked. 🚀

@@ -134,7 +133,14 @@ fn create_batch(size: usize) -> Result<Chunk> {
})
.collect();

<<<<<<< HEAD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like merge conflicts were still in the commit. @jorgecarleitao

Copy link
Owner Author

@jorgecarleitao jorgecarleitao Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6jiwtn

Fixed in #1069

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants